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In this work we introduce a new and richer class of finite 
order Markov chain models and address the following model 
selection problem: find the Markov model with the minimal set 
of parameters (minimal Markov model) which is necessary to 
represent a source as a Markov chain of finite order. Let us call 
M the order of the chain and A the finite alphabet, to determine 
the minimal Markov model, we define an equivalence relation on 
the state space A M , such that all the sequences of size M with 
the same transition probabilities are put in the same category. 
In this way we have one set of (\A\ — 1) transition probabilities 
for each category, obtaining a model with a minimal number of 
parameters. We show that the model can be selected consistently 
using the Bayesian information criterion. 

1. Introduction. In this work we consider discrete stationary processes over a finite 
alphabet A. Markov chains of finite order are widely used to model stationary processes 
with finite memory. A problem with full Markov chains models of finite order M is that 
the number of parameters (|v4| M (|yl| — 1)) grows exponentially with the order M, where \A\ 
denotes the cardinal of the alphabet A. Another characteristic is that the class of full Markov 
chains is not very rich, fixed the alphabet A there is just one model for each order M and 
in practical situations could be necessary a more flexible structure in terms of number of 
parameters. For an extensive discussion of those two problems se Buhlmann P. and Wyner 
A. [1]. A richer class of finite order Markov models introduced by Rissanen J. [6] and 
Buhlmann P. and Wyner A. [1] are the variable length Markov chain models (VLMC) which 
are mentioned in section 2.3. In the VLMC class, each model is identified by a prefix tree T 
called context tree. For a given model with a context tree T, the final number of parameters 
for the model is |7~|(|A| — 1) and depending on the tree, this produce a parsimonious model. 
In Csiszar, I. and Talata, Z. [4] is proved that the bayesian information criterion (BIC) can 
be used to consistently choose the VLMC model in an efficient way using the context tree 
weighting (CTW) algorithm. 

In this paper we introduce a larger class of finite order Markov models, and we address 
the problem of model selection inside this class, showing that the model can be selected 
consistently using the BIC criterion. In our class, each model is determined by choosing a 
partition of the state space, our class of models include the full Markov chain models and 
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the VLMC models because a context tree can be seen as a particular partition of the state 
space (see for illustration the example 2.1). 

In Section 2, we define the minimal Markov models and show that this models can be 
selected in a consistently in theorems 2.1 and 2.2. In Section 3 we show two algorithms that 
use the results in Section 2 to choose consistently a minimal Markov model for a sample and 
some simulations. Section 4 have the conclusions and Section 5 have the proofs. 

2. Minimal Markov models. 

2.1. Notation. Let (X t ) be a discrete time order M Markov chain on a finite alphabet 
A. Let us call S = A M the state space. Denote the string a m a m+ i . . . a n by a™ , where 
dj G A, m < i < n. 

Let C = {Li, L 2 , . . . , Lk} be a partition of S, 

(1) P(L, a) = J2 ProMX^ = s,X t = a), a G A, L G C; 

seL 



(2) 



P(L) = ^ob(X t t Z 1 M = s), LeC. 



seL 



Let Xi be a sample of the process (Xt), s G S, a G A and n > M. We denote by N n (s, 
the number of occurrences of the string s followed by a in the sample X™, 



(3) 



N n (s,a) 



{t : M < t < n, xl_\[ = s, x t = a} 



the number of occurrences of s in the sample is denoted by N n (s) and 



(4) 



N n (s) = {t:M<t<n,x t 



t-i 

-M 



s} 



The number of occurrences of elements into L followed by a is given by, 



(5) 



N^L,a) = Y,N n (s,a), LeC; 

seL 



the accumulated number of N n (s) for s in L is denoted by, 
(6) 



N*(L) = £ LeC. 

seL 



2.2. Good partitions of S . 



Definition 2.1. Let (X t ) be a discrete time order M Markov chain on a finite alphabet 
A, S = A M the state space. A partition C = {L ly L 2 , . . . ,L K } of S is a good partition of S 
if for each s,s' G L, L G C, 



Prob(X t 



I vt-l 
l A t-M 



s ) = Prob(X t = .\Xl 



t-i 

-M 



Remark 2.1. For a discrete time order M Markov chain on a finite alphabet A with 
S = A M the state space, C = S is a good partition of S. 
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If £ is a good partition of S, we define for each category L G C 

(7) P(a\L) = Prob(X t = o\X\Zm = s) Va G A, 

where s is some element into L. As a consequence, if we write P(x") = Prob(X[ l = x"), we 
obtain 

(8) P(arJ) = P(x? ) 1] i 3 (a|-L) JV '" (I " a) . 

Le£,aeA 

In the same way that Csiszar, I. and Talata, Z. [4] we will define our BIC criterion using a 
modified maximum likelihood. We will call maximum likelihood to the maximization of the 
second term in the equation (8) for the given observation. For the sequence x™, will be 

Vf(L,a) 

(9) ML(C,x n 1 ) = [] 



LeC.aeA 



'r n (L, a) 

, Tn{L) 



where 



(10) r n (L,a) = N "( L ' a \ a e A, LeC and tJl) = N ^- ) LeC. 
v ' n v ' n 

The BIC is given by the next definition 

Definition 2.2. Given a sample x™, of the process {X t ), a discrete time order M Markov 
chain on a finite alphabet A with S = A M the state space and C a good partition of S. The 
BIC of the model (9) is given by 

BIC(£,x?) = ln(ML(£,^)) - ~ ^ ln(n). 
2.3. Good partitions and context trees. 

Let (Xt) be a finite order Markov chain taking values on A and T a set of sequences of 
symbols from A such that no string in T is a suffix of another string in T, for each s G T, 
d(T) = max (j(s), s G Tj where l(s) denote the length of the string s, with Z(0) = if the 
string is the empty string. 

Definition 2.3. T is a context tree for the process (X t ) if for any sequence of symbols 
in A, x™ sample of the process with n > d(T), there exist s G T such that 

Prob(X n+1 = a\X? = <) = Prob(X n+1 = a\X^_ l{s)+1 = s) 

d(T) is the depth of the tree. 
The context tree is the minimal state space of the variable length Markov chain (VLMC), 
Buhlmann P. and Wyner A. [1]. The context tree for a VLMC with finite depth M define 
a good partition on the space S = A M as illustrated by the next example. 

Example 2.1. Let be a VLMC over the alphabet A = {0,1} with depth M = 3 and 
contexts, 

{0},{01},{011},{111} 

This context tree correspond to the good partition {Li, L 2 , L 3 , L 4 } where 

L 1 = {{000}, {100}, {010}, {110}}, L 2 = {{001}, {101}}, L 3 = {011} and L 4 = {111}. 



4 J. E. GARCIA AND V. A. GONZALEZ-LOPEZ 

2.4. Smaller good partitions. 

Definition 2.4. Let C ij denote the partition 

O 3 = {Li, . . . , i, L^, . . . , Lj_i, Lj + i, . . . , Lk}, 
where C = {Li, . . . , L K } is a good partition of S, and for 1 < i < j < K with L^ = Li U Lj. 

Now we adapt the notation established for the partition C to the new partition C 13 . 

Notation 2.1. for a e A we write, 

P(Lij,a) = P(L h a) + P(Lj,a); 
P(L ii ) = P{L,i) + P{L,). 



(11) K (L^a) = N^{L i ,a) + N^{L J ,a); 

(12) N? (Lis) = N£(Li) + NftLj); 

If P(.\Li) = P(.\Lj) then D 3 is a good partition and (7) remains valid for C %3 , just is 
necessary to change C by O 3 in equations (8), (9) and definition (2.2). 
In the following theorem, we show that the BIC criterion provides a consistent way of de- 
tecting smaller good partition. 

Theorem 2.1. Let (X t ) be a Markov chain with order M over a finite alphabet A, S = 
A M the state space. If C = {Li, L 2 , . . . , Lk) is a good partition of S and Li ^ Lj, Li, Lj G C. 
Then, eventually almost surely as n — )■ oo, 

I{BIC(Cw ,x™)>BIC(C,xl)} — 1 

if, and only if 

P{a\Li) = P{a\Lj) Va e A. 

Where I a is the indicator function of A, and the C %3 partition is defined under C by equation 
(2.4). 

Next we extract from the previous theorem the relation that we use in the next section, 
in practice to find smaller good partitions. 

Definition 2.5. Let be (X t ) a Markov chain of order M, with finite alphabet A and state 
space S = A M , x\ a sample of the process and let C = {L±, L 2 , . . . , L^} be a good partition 
ofS, 

ac -\ 1 st\ajC(t m ( N n{Li,a)\ , (N n (Lj,a)\ 
d c{hJ) = 7-7-t 2^ < N n {Li, a) In + N n {Lj, a) In 1 



(13) —N^ {L^, a) In 



c: ' i ,, I N n {Lij, a) 



N n {L 
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BIC(C,x1)-BIC{C i3 ,x1)<U <=^> d c (i,j)< " \ 

Proof. From equation (14) we have the validity of the result. □ 

Remark 2.2. The results will remain valid if we replace the constant ^Hp 1 ^ for some 
arbitrary constant, positive and finite value v, into the definition (2.2). 

Remark 2.3. Under the assumptions of theorem 2.1, if P(a\LA ^ P(a\Lj) for some 
a G A, then eventually almost surely as n — >■ oo, BIC(C,x'1) > BIC(C l i \xT) where 
verified the definition (2.4)- 

2.5. Minimal good partition. 

We want to find the smaller good partition into the universe of all possible good partitions 
of S. This special good partition could be defined as follows and it allows the definition of 
the most parsimonious model into the class considered in this paper. 

Definition 2.6. Let (X t ) be a discrete time order M Markov chain on a finite alphabet 
A, S = A M the state space. A partition C = {L\, L 2 , . . . , Lk} of S is the minimal good 
partition of S if, ML G £, 

s,s' eL if, and only if Prob(X t = . \X\Zm = s ) = Prob(X t = . \X\Zm = s'). 

Remark 2.4. For a discrete time order M Markov chain on a finite alphabet A with 
S = A M the state space, 3! minimal good partition of S. 

In the next example we emphasize the difference between good partitions and the minimal 
good partition, 

The next theorem shows that for n large enough we achive the partition C* which is the 
minimal good partition. 

Theorem 2.2. Let (X t ) be a Markov chain with order M over a finite alphabet A, 
S = A M the state space and let V be the set of all the partitions of S. Define, 

C* n = argmaxc £ -p{BIC(£,x±)} 

then, eventually almost surely as n — >■ oo, 

/** /** 

3. Minimal good partition estimation algorithm. 

Algorithm 3.1. (MMM algorithm for good partitions) 
Consider x™ a sample of the Markov process (Xt), with order M over a finite alphabet A, 
S = A M the state space. 

Let be C = {L\, L 2 , . . . , L^} a good partition of S, for each s G S, 
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1 fori = l,2,--- ,K-1, 

for j = 2 + 1,2, ■•• ,K, 
Calculate dc(i,j) 

n Wc («..?)< a } 

2 If R 1 ^ = 1, define = Lj U Lj and C = . Else % — i + 1, Return to step 1 
The algorithm allows to define the next relation based on the sample x™, 
Definition 3.1. forr,seS; r ~ n s R^( r )A^) — i. 

For n large enough, the algorithm return the minimal good partition. 

Corollary 3.1. Let {X t , t — 0, 1, 2, . . .} be a Markov chain with order M over a finite 
alphabet A, S = A M and x\ a sample of the Markov process. C n , given by the algorithm 
(3.1) converges almost surely eventually to £*, where C* is the minimal good partition of S. 

Proof. Because K < oo, for n large enough, the algorithm return the minimal good 
partition. □ 

Remark 3.1. In the worst case, which correspond to an initial good partition equal to 

S, we need to calculate the term y n^{l) ) f or eac ^ s e P^ us K{K ~ divisions 

to implement the algorithm (3.1). 

The next algorithm is a variation of the first. In this case the partitions are grow selecting 
the pair of elements with the minimal value of {djc(i,j), the algorithm stop when there is 
not {d c (i,j) lower than (\A\ - l)/2. 

Algorithm 3.2. Consider x™ a sample of the Markov process (Xt), with order M over 
a finite alphabet A, S = A M the state space. 
Let C = {Li, L 2 , . . . , Lk} be a good partition of S 

1 Calculate 

[i*,f) = arg min {d C (i,j)} 

i,]\l<i<3<K 

2 If d c{id) < ^zl then C = Ci*f, K = K - 1 and return to 1. 
Else end. 

This algorithm is consistent and always return a partition but have a greater computational 
cost. Taking in consideration that the cost depend on K and that for a Markov chain of order 
M we consider samples of size n such that log(n) > M. The two algorithms 3.1 and 3.2 have 
a computational cost that is linear in n (the sample size). 
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3.1. Dendrograms and MMM algorithm . In practice, when the sample size is not large 
enough and the algorithm 3.1 has not converged, it is possible that the algorithm will not 
return a partition of S, independent of the value used in v. In that case, a better approach 
can be to use for each r, s e S the function d n (r, s) as a similarity measure between r and 
s. Then d n (r, s) can be used to produce a dendrogram and then use the partition defined by 
the dendrogram as the partition estimator. 

Also in practice it is possible that the maximum number of free parameters in our model 
is limited by a number K. In that case, the logic choice will be to find a value of d in the 
dendrogram such that the size of the partition obtained cutting the dendrogram in d is less 
or equal to K, the chosen model will be the one defined by that partition. 

Example 3.1. Consider a Markov chain of order M = 3 on the alphabet A = {0, 1,2} 
with classes: 

L x = {000,100,200,010,110,210,020,120,220,022,122,222}, 

L 2 = {001,101,201,011,111,211,021,121,221}, 

L 3 = {012,112,212,002}, 

U = {102}, 

U = {202}, 

and transition probabilities, 



P(0\Li) 


= 0.2, 




= 0.3, 


P(0\L 2 ) 


= 0.4, 


P(1\L 2 ) 


= 0.3, 


P(0\L 3 ) 


= 0.4, 


P(ML 3 ) 


= 0.1, 


P(0\Li) 


= 0.1, 


P(1|L 4 ) 


= 0.4, 


P(0\L 5 ) 


i = 0.3, 


P(ML 5 ) 


= 0.5. 



On this example, \A\ = 3 so the penalty constant is 1 = 1 '~ . We simulated samples 
of sizes n = 5000 and 9000, obtaining dendrograms on figure 3.1. The dendrogram for the 
sample size of 9000 gives the correct partition. 

3.2. Simulations. We implemented a simulation study for the model described on exam- 
ple 3.3. More precisely we simulated 1000 samples of the process for each of the sample 
sizes 4000, 6000, 8000 and 10000. For each sample we calculate the values d n (r, s) and build 
the corresponding dendrogram (using the R-project package hclust with linkage method 
complete). Table 1 show the results. 
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Tl 



m 



Fig 1. The figure shows the dendrograms for the model on example 3.3 estimated using algorithm ?? for 
sample sizes o/5000 (upper picture) and 9000 (lower picture). 

Table 1 

Number of errors on the partition estimated for the model on example 3.3 



Sample size 


Proportion of errors 


4000 


0.801 


6000 


0.495 


8000 


0.252 


10000 


0.161 



3.3. Simulations. The VLMC corresponding to the partition on example (), have con- 
texts: 



T\ 


= w, 


T 2 


= {1}, 


T 3 


= {12}, 


T 4 


= {102}, 


T 5 


= {202}, 


T 6 


= {22}, 


T 7 


= {002}. 



We simulated 1000 samples of the process for each of the sample sizes 4000, 6000, 8000 
and 10000. Using the tree as a basic good partition, for each sample we calculate the values 
d n (Li, Lj) corresponding to the algorithm (3.1) and build the corresponding partitions. Table 
(2) show the results. 

Starting from the good partition corresponding to the context tree, the number of possible 
models is substantially reduced compared to those in the simulation on section (3.2) and 
because of that, the error rates on this simulation are much better than before. 
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Table 2 



proportion of errors on the partition estimated for the model of example (3.3) 



Sample size 



Proportion of errors 



4000 
6000 
8000 
10000 



0.614 
0.206 
0.047 
0.007 



4. Conclusions. Our main motivation to define the minimal Markov models is, in the 
first place, the concept of partitioning the state space in classes in which the states are 
equivalent, this allow us to model the redundancy that appears in many processes in the 
nature as in genetics, linguistics, etc. Each class in the state space has a very specific, clear 
and practical meaning: any sequence of symbol in the same class has the same effect on the 
future distribution of the process. In other words, they activate the same random mechanism 
to choose the next symbol on the process. We can think of the resulting minimal partition 
as a list of the relevant contexts for the process and their synonymous. 

In second place our motivation for developing this methodology is to demonstrate that for 
a stationary, finite memory process it is theoretically possible to find consistently a minimal 
Markov model to represent this process and that this can be accomplished in practice. The 
utilitarian implication of the fact that the model selection process can be started from a 
context tree partition, is that minimal Markov models can be easily fitted to stationary 
sources where the VLMC models already works. 

It is clear that there are applications on which the natural partition to estimate is neither 
the minimal nor a context tree partition. As long as the partition particular properties are 
well defined, we can use theorem 2.1 to estimate the minimal partition satisfying those 
properties. 

Our theorems are still valid if we change the constant term in the penalization of the BIC 
criterion for any positive (and finite) number. In the case of the VLMC model, the problem 
of finding a better constant has been addressed in diverse works as for example Buhlmann 
P. and Wyner A. [1] and Galves, A., Galves, C, Garcia N. L. and Leonardi F. [5]. 



Definition 5.1. Let be P and Q probability distributions on A. The relative entropy 
between P and Q is given by, 



5. Proofs. 




5.1. Proof of theorem 2. 1 . 
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as consequence, 

V n (L;,a) 



BIC(£,x?)-BIC(&,xl) = \NZ(Li, a) In 

aeA I 



+iV 7 f(L i) a)ln 



r n (Li) 
r n (Lj,a) ^ 



3, 



(14) - Nf ( % a) in (^g>) } - M^il ln(n) . 

We note that, the condition l{Bic{c i i ,x n )>Bic(c,x n )} = 1 is true if, and only if 
2^<r n (Li,a)ki\ —— \ +r n (L j , a) In' 



aeA 



r n (Li) J J ' V r n (Lj) 



(15) ^..^(^^M^EW. 

Because r n (L,a) and r n (L) are non-negative, using Jensen we have that, 

r " (i " a)1,1 l^rj +r " ( j,a) l^r) - 

/ / r s . / r u, / r n{Li,a) + r n (Lj,a)\ 
(r n (Li, a) +r n (L j , a)) In 1 1 



r n (Li) + r n (Lj) J 
or equivalently, 

ftti\ it m / r n(£i,a)\ . , r v, fr n (Lj,a)\ (r n (L ij} a) 

(16) r „(L,,a)ln^-^j + r „ (ij , a) In j > r„(L„, a) In (^-^ 



r n (Lj,a) _ r n (Lj,a) 

As consequence, equation (16) 



with equality if and only if = r -^i^l V a e A. 



/ / r M ( r n(Li,a)\ fr n (Lj,a) 
2^ ^ r„(L f , a) In — — + r n (Iy, a) In 1 



aeA 



r n (Li) J J ' V r n (Lj) 



(IT) _ r „ (iij , a)ln (^i)}>0, 

with equality if and only if = Va G A. 

Considering that (|A| ~ 2 ^ ln(n) -> 0, as n — )■ oo and from the equation (15), we have that if 
li m n-s>oo I{Bic(oi ,x")>bic(c,x™)} — 1> then 

f / r m fr n (Li,a)\ fr n (Lj,a)\ 
lim 2^ <^r n (Li,a)ln — — + r n (Lj, a) In ' 1 



n— ¥00 



-<^Kw)} £0 
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from equation (17) and taking the limit inside the sum we obtain 



aeA 



using Jensen again, this means that P p^°^ = ^j^y Va G A, or equivalently, P(a\Li 



P(a\LA Va G A. 



For the other half of the proof, suppose that P(a\LA = P(a\Lj) Wa G A, as a consequence 
we have that 



(18) 



P(a|Z^) = P(a\Li) Wa e A 



BIC{C,xl) - BIC(C ij ,x?] 



+ ln IS v *m ; 

_ in I TT ( Nn 'J^Mi^l 



(H-i; 



ln(n). 



Now, considering that * s ^ e max i mum likelihood estimator of P(a|Ly), 



> I] P(a|L^ r( ^' a) 



BIC(C,xf) - BIC(£ ij ,x™) is bounded above by 



In |n W^ a 



ln|n WL j ,.) 



infn^i^M-^ 4 ^ 1 ^ 

VaeA / z 



N C (T)D ( iV ^ Li, • 



P,.|Z^|-M^ln(n). 



Where P(p||g) is the relative entropy, given by definition (5.1). The first equality came 
from (18) and (11). Using proposition (??), proposition (??), for any 5 > and n large 
enough, 



(19) 
(20) 



D 



n-\L)\ < £ 



P(a|P) 



P(a|L) 



< E 



gln(n) 
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Then for any 5 > and n large enough, 



BIC(C, xT) - BIC(C 3 , xT) < 



25\A\ 



ln(n) - 
(28\A\ 



(\A\-l) 



ln(n) 



P 



2 



ln(n) 



(14-1) 



2 




BIC(C, a;?) - BIC(C %3 ,xl) < 
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